A Preliminary Approach to the Multilabel Classification Problem of Portuguese Juridical Documents
نویسندگان
چکیده
Portuguese juridical documents from Supreme Courts and the Attorney General’s Office are manually classified by juridical experts into a set of classes belonging to a taxonomy of concepts. In this paper, a preliminary approach to develop techniques to automatically classify these juridical documents is proposed. As basic strategy, the integration of natural language processing techniques with machine learning ones is used. Support Vector Machines (SVM) are used as learning algorithm and the obtained results are presented and compared with other approaches, such as, C4.5, and Nave Bayes.
منابع مشابه
The impact of NLP techniques in the multilabel text classification problem
Support Vector Machines have been used successfully to classify text documents into sets of concepts. However, typically, linguistic information is not being used in the classification process or its use has not been fully evaluated. We apply and evaluate two basic linguistic procedures (stop-word removal and stemming/lemmatization) to the multilabel text classification problem. These procedure...
متن کاملA Question Answer System for Legal Information Retrieval
In this paper we present a question-answering system for Portuguese juridical documents. The system has two modules: preliminary analysis of documents (information extraction) and query processing (information retrieval). The proposed approach is based on computational linguistic theories: syntactical analysis (constraint grammars); followed by semantic analysis using the discourse representati...
متن کاملMultilabel Classification of Documents with Mapreduce
Multilabel classification is the problem of assigning a set of positive labels to an instance and recently it is highly required in applications like protein function classification, music categorization, gene classification and document classification for easy identification and retrieving of information. Labeling the documents of the web manually is a time consuming and a difficult task due t...
متن کاملAnalysing Part-of-Speech for Portuguese Text Classification
This paper proposes and evaluates the use of linguistic information in the pre-processing phase of text classification. We present several experiments evaluating the selection of terms based on different measures and linguistic knowledge. To build the classifier we used Support Vector Machines (SVM), which are known to produce good results on text classification tasks. Our proposals were applie...
متن کاملEfficient Multilabel Classification Algorithms for Large-Scale Problems in the Legal Domain
In this paper we evaluate the performance of multilabel classification algorithms on the EUR-Lex database of legal documents of the European Union. On the same set of underlying documents, we defined three different large-scale multilabel problems with up to 4000 classes. On these datasets, we compared three algorithms: (i) the well-known one-against-all approach (OAA); (ii) the multiclass mult...
متن کامل